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Claims 

[cl l^A method for representing the latent semantic content of a plurality of 

documents, each document containing a plurality of terms, the method 
comprising: 

derivmg at least one n-tuple term from the plurality of terms; 

forming a two-dimensional matrix, 

each mawix column c corresponding to a document, 

each matmc row r corresponding to a term occurring in at least one document 
corresponding to a matrix column, 

each matrix element (r, c) related to the number of occurrences of the term 

corresponding tathe row r in the document corresponding to column c, 

at least one matrix\element related to the number of occurrences of one at least 

one n-tuple term occurring in the at least one document, and 

performing singular valwe decomposition and dimensionality reduction on the 

matrix to form a latent semantic indexed vector space. 

[c2] 2. The invention as recitedVi Claim 1 further comprising: 

identifying an occurrence threshold; 

wherein n-tuples that appear ress times in the document collection than the 
occurrence threshold are not included as elements of the matrix. 

[c3] - 3. The invention as recited in Claim 1 wherein the occurrence threshold is two. 

[c4] 4. The invention as recited in Claim \ wherein deriving at least one n-tuple term 

further comprises: 

creating the at least one n-tuple term fliom n consecutive verbatim terms. 

[c5] 5. A method for determining conceptual similarity between a subject document 

and at least one of a plurality of reference documents, each document 
containing a plurality of terms, the method Comprising: 
deriving at least one n-tuple term from the plurality of terms, 
forming a plurality of two-dimensional matriceawherein, for each matrix: 
each matrix column c corresponds to a documen\, one column corresponding to 
the subject document; 

each matrix row r corresponds to a term occurring \r\ at least one document 
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corresponding to a matrix column, 

eacmmatrix element (r, c) represents the number of occurrences of the term 
corresponding to r in the document corresponding to c; 
performing singular value decomposition and dimensionality reduction on a 
plurality of formed matrices, to form a plurality of latent semantic indexed 
vector spaoes, 

the latent semantic indexed vector spaces including at least one space formed 
from a matrixNincluding at least one element corresponding to the number of 
occurrences of Vt least one n-tuple term in at least one document, 
determining at least one composite similarity measure between the subject 
q document and at iList one reference document as a function of a weighted 

iz similarity measure ot the subject document to the reference document in each 

Si of a plurality of indexed vector spaces. 

yy \ 

^ i \ 

[c6] 6. The method as recitea in Claim 5 wherein the similarity measures from vector 

yj spaces comprising greate\ numbers of n-tuples are weighted greater than 

s\ similarity measures from vector spaces comprising lesser number of n-tuples. 

nJ 

P [c7] 7. A method for representing^ query document, the query document containing 

q verbatim terms, the query document intended for querying a collection of 

^ reference documents via a latentvsemantic indexed representation of the 

reference collection; the method comprising: 

identifying verbatim entities; 

stemming identified entities; 

generalizing stemmed entities; and 

supplementing verbatim entities with corresponding generalized entities. 



[c8] 



8. A method for representing a query document, the query document containing 
verbatim terms, the query document intended for querying a collection of 
reference documents via a latent semantic indexed representation of the 
reference collection; the method comprising: 
identifying verbatim entities; 
stemming identified entities; 
generalizing stemmed entities; and 
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replacing verbatim entities with corresponding generalized entities. 
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9. The method as recited in Claim 8 wherein verbatim entities comprise ordered 
terma between stop words. 

10. Thamethod as recited in Claim 8 wherein generalizing entities further 
comprises alphabetically ordering stemmed entities as an aid to generalization. 

1 1 . The method as recited in Claim 8 wherein generalizing entities further 
comprises ordering stemmed entities as a function of the frequency of 
occurrence ofWrbatim entities. 

1 2. The method\as recited in Claim 8 wherein generalized entities are identified 
with human feedback. 

1 3. The method as Wited in Claim 8 wherein generalized entities are identified 
by automated process 

1 4. A method for characterizing the results of a query into a latent-semantic- 
indexed document spaceXthe query comprising at least one term, the results 
comprising a set of document identities; the method 
comprising:_Ref 532030037 

ranking results as a function S{ at least the frequency of occurrence of at least 
one term. 

1 5. The method as recited in ClairH 14 wherein at least one term used in 
ranking is a query term. 

1 6. The method as recited in Claim 1 5\vherein the at least one query term used 
in ranking is a generalized entity. 

1 7. The method as recited in Claim 14 whe\ein the at least one term used in 
ranking is a generalized entity. 

1 8. A method for determining conceptual similarity between a query document 
and at least one of a plurality of reference documents, each document 
comprising a plurality of verbatim terms, the reference documents indexed into 
a latent semantic index space, the method comprising:_Ref532038902 
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identifying verbatim entities; 

\ 

stemming identified entities; 
generalizing stemmed entities; 

replacing at least one verbatim entity with the corresponding generalized entity 
to forrta a generalized query; 

identifymg a set of reference documents based on closeness, within the latent 

semanticyidexed space, between the generalized query and each reference 

document;^and 
V 

ranking a subset of closest identified documents as a function of at least the ' 

\ 

frequency of occurrence of at least one term. 

[cl 9] 1 9. The method\s recited in Claim 1 8 wherein at least one term used in 

01 ranking is a query term. 

S [c20] 20. The method as nkited in Claim 1 9 wherein the at least one query term used 

iy \ 

CP in ranking is a generalized entity. 

E - I \ 

|\ [c21] 21 . The method as recited in Claim 1 8 wherein the at least one term used in 

ranking is a generalized entity. 



□ f c22 l 22. A method for representing the latent semantic content of a plurality of 

documents, each document containing a plurality of verbatim terms, the 
method comprising: \ 

deriving at least one expansion phrase from the verbatim terms, 
each expansion phrase comprisingWms; 

replacing at least one occurrence of a verbatim term having an expansion 
phrase with the expansion phrase cor\esponding to that verbatim term; 
forming a two-dimensional matrix, 
each matrix column c corresponding to £ document; 
each matrix row r corresponding to a terr 
each matrix element (r, c) representing the\number of occurrences of the term 
corresponding to r in the document corresponding to c; 

at least one matrix element corresponding to\he number of occurrences of one 
at least one term occurring in the at least one expansion phrase, and 
performing singular value decomposition and dimensionality reduction on the 
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matrix to form a latent semantic indexed vector space. 
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23. A method for representing the latent semantic content of a plurality of 
documents, each document containing a plurality of terms, the method 
comprising 

identifying Vt least one idiom among the documents, 
each idiom containing at least one idiom term; 
forming a twoVdimensional matrix, 
each matrix column corresponding to a document; 

each matrix row corresponding to a term occurring in at least one document 
represented by a row; 

each matrix element representing the number of occurrences of the term 
corresponding to the\element's row in the document corresponding to element's 
column; 

at least one occurrence )pf at least one idiom term being excluded from the 
number of occurrences corresponding to that term in the matrix, 
performing singular value decomposition and dimensionality reduction on the 
matrix. 



1 [c24] 



24. A method for representingvthe latent semantic content of a plurality of 
documents, each document containing a plurality of terms, the method 
comprising: 

identifying at least one idiom amoVig the documents, 
each idiom containing at least one raiom term; * 
replacing at least one identified idiom with a corresponding idiom elaboration, 
each elaboration comprising at least o\ie elaboration term, 
forming a two-dimensional matrix, 
each matrix column corresponding to a document; 
each matrix row corresponding to a term;* 

each matrix element representing the number of occurrences of the term 
corresponding to the element's row in the document corresponding to element's 
column, 

at least one matrix element corresponding to the\number of occurrences of an 
elaboration term in a document corresponding to a^matrix column; 
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performing singular value decomposition and dimensionality reduction on the 



matrix. 



\ 
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